Skip to content

perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417)#448

Merged
trilamsr merged 1 commit into
mainfrom
perf/417-xid-correlation-allocs
Jun 2, 2026
Merged

perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417)#448
trilamsr merged 1 commit into
mainfrom
perf/417-xid-correlation-allocs

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Lowers BenchmarkXidCorrelationDetector allocs/event from 12.40 → 1.88 (a 6.6x reduction), hitting the NORTHSTAR ≤2 target. Ratchets the absolute-ceiling gate from 13 → 2 and refreshes the baseline.

Follows the recipe established by PR #438 (nccl_hang: 3.99 → 1.70) and extends it with a composite-buffer + unsafe.String aliasing pass, which is load-bearing here because xid_correlation emits one verdict per evicted pod (896 verdicts / 1024-event bench window) vs nccl_hang's ~64 verdicts — per-verdict allocation cost dominates allocs/event.

Root cause

Per-verdict the detector built five variable-length prose strings (Headline, Remediation, xid-evidence UID, pod-evidence Description, plus EvictedPod from displayPodName's ns + "/" + name concat). Four of those used fmt.Sprintf, each costing one interface-boxing allocation per integer/string argument plus one result-string allocation. At 896 verdicts × ~11 allocs/verdict + a growing verdicts slice + a default-sized indexXidsByNode map, the measured floor was 12699 allocs/op.

Approach

  1. Composite-buffer carve (the load-bearing change). buildXidCorrelationVerdict now pre-computes the rendered length of every variable-length string, allocates ONE []byte of exactly that total size, appends each field into a known offset range, then carves immutable string headers out via unsafe.String(&buf[lo], hi-lo). The buffer is finalized before any string is carved, so the aliasing is sound by construction. Per-verdict allocation floor drops from 6 to 2 (the composite buffer + the 2-element EvidenceTrail backing array).
  2. Pre-size join structures. indexXidsByNode map is sized to len(xids); the verdicts slice is sized to len(events).
  3. strconv-based int rendering via stack-allocated [20]byte scratch arrays + strconv.AppendInt (no fmt.Sprintf, no strconv.Itoa's allocation).
  4. Strings.Builder fallback for xidEvidenceUID and xidEvidenceDescription (shared with hbm_ecc.go).

The unsafe.String aliasing trick is documented inline; the invariant ("buffer is immutable post-carve") is enforced by construction in buildXidCorrelationVerdict (no further append after the carve block).

Before / after

Bench: go test -bench=BenchmarkXidCorrelationDetector -benchmem -count=5 -benchtime=500ms ./bench/detectors/ on M1 Max.

Metric Before After Delta
allocs/op 12699 1928 −85%
B/op 907 KiB 597 KiB −34%
allocs/event 12.40 1.88 hits NORTHSTAR (≤2)

Profile entries fixed (per pprof -sample_index=alloc_objects):

  • xidCorrelationHeadline fmt.Sprintf → composite-buffer carve
  • xidCorrelationRemediation fmt.Sprintf → composite-buffer carve
  • xidEvidenceUID fmt.Sprintf → composite-buffer carve
  • pod-event Description fmt.Sprintf → composite-buffer carve
  • displayPodName ns + "/" + name concat → folded into composite buffer
  • indexXidsByNode map → pre-sized to len(xids)
  • verdicts slice grow chain → pre-sized to len(events)

Residual per-verdict allocations (the two unavoidable ones at NORTHSTAR floor):

  1. Composite prose buffer (1 alloc, ~660 B holding all 5 strings).
  2. EvidenceTrail 2-element backing array (1 alloc, escapes on the returned verdict).

Semantics

Golden-replay byte-identity preserved: module/pkg/replay/xid_correlation/canonical/golden.json passes unchanged. The unsafe.String-carved strings alias an immutable buffer; JSON serialization compares content equality, not pointer identity.

Files changed

  • module/pkg/patterns/xid_correlation.go — composite-buffer rewrite of buildXidCorrelationVerdict; pre-sized join maps; dead Sprintf-era renderers removed.
  • bench/detectors/baselines.json — refreshed BenchmarkXidCorrelationDetector entry to the new floor.
  • scripts/bench-registry.sh — ceiling 13 → 2; comment updated to record the ratchet and NORTHSTAR-hit.

Test plan

  • go test -race ./pkg/patterns/... ./pkg/replay/... ./processor/... -count=1
  • go test -bench=BenchmarkXidCorrelationDetector -benchmem -count=5 ./bench/detectors/ — stable at 1928 allocs/op
  • ./scripts/bench-allocs-check.sh — every detector at-or-below ceiling
  • ./scripts/bench-check-detectors.sh — within 10% of baseline
  • go vet ./..., golangci-lint run ./... — clean (pre-commit hook)
  • module/pkg/replay/xid_correlation/canonical/golden.json — byte-identical replay

Sibling work

perf: xid_correlation detector allocations dropped from 12.4 to 1.88 per event (hits the ≤2 NORTHSTAR target). Verdict output is byte-identical.

Closes #417

Pack all five variable-length prose strings per verdict (EvictedPod,
Headline, Remediation, xid-evidence UID, pod-evidence description)
into a single composite byte buffer, then carve immutable string
headers out of that buffer via unsafe.String aliasing. Combined with
the 2-element EvidenceTrail backing array, the per-verdict floor is
2 allocations regardless of input scale. Detector hits NORTHSTAR.

Recipe mirrors PR #438 (nccl_hang): pre-size join maps and verdicts
slice; replace fmt.Sprintf with strconv.AppendInt-based builders;
fold per-verdict integer formatting onto stack scratch arrays. Adds
a composite-buffer + unsafe.String pass that #438 did not need
(nccl_hang emits ~64 verdicts vs xid_correlation's 896 per bench
pass -- per-verdict cost dominates here).

Bench (bench/detectors/, count=5, M1 Max):
  before: 12699 allocs/op, 907 KiB/op, 12.40 allocs/event
  after:   1928 allocs/op, 597 KiB/op,  1.88 allocs/event
  delta:  -85% allocs, -34% bytes; hits NORTHSTAR (<=2)

Hotspots eliminated (per memprofile alloc_objects):
- xidCorrelationHeadline    fmt.Sprintf -> composite-buffer carve
- xidCorrelationRemediation fmt.Sprintf -> composite-buffer carve
- xidEvidenceUID            fmt.Sprintf -> composite-buffer carve
- pod-event Description     fmt.Sprintf -> composite-buffer carve
- displayPodName concat     folded into composite buffer
- indexXidsByNode map       pre-sized to len(xids)
- verdicts slice grow       pre-sized to len(events)

Verdict semantics byte-identical: golden replay
(module/pkg/replay/xid_correlation/canonical/golden.json) passes
unchanged. The carved strings alias an immutable backing buffer;
JSON serialization compares content equality, not pointer identity.

Ceiling lowered from 13 -> 2 in scripts/bench-registry.sh; baseline
refreshed in bench/detectors/baselines.json.

Closes #417

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Review: xid_correlation allocs 12.4 → 1.88 (NORTHSTAR hit)

B/A/A+ Criteria

Detector correctness: Verdict output byte-identical (golden replay confirmed via no golden.json changes). Prose strings match original semantics across all 5 fields (EvictedPod, Headline, Remediation, xid-UID, pod-description).

Go perf discipline: unsafe.String aliasing is sound by construction (buffer finalized before carve, carved strings don't outlive heap-allocated backing, no further appends post-carve). Stack-allocated scratch arrays don't escape (copied into composite before next use).

Simplicity: Approach mirrors established nccl_hang pattern (PR #438). Dead functions inlined (xidCorrelationHeadline, xidCorrelationRemediation). Helpers justifiable: intToScratch (3 uses, stack-safety documentation), stringSlice (5 uses, contract clarity), writeInt (3 uses, builder encapsulation).


Unsafe.String Safety Audit

Buffer finalization (CRITICAL): Lines 158-209 fill composite buffer via appends; lines 212-216 carve unsafe.String aliases; no appends after line 209. Safe.

Length pre-computation: podDisplayLength (lines 248-259) mirrors displayPodName (pod_evicted.go:189-196) exactly across all three branches. Capacity pre-sizing correct.

Carved string lifetimes: composite is heap-allocated in buildXidCorrelationVerdict; carved strings escape in returned XidCorrelationVerdict struct. Go runtime keeps heap buffer alive as long as strings reference it (standard semantics).

Self-reference appends safe: Lines 186, 199, 206 read already-appended ranges (composite[podStart:podEnd]) within the same growing buffer. No aliasing to external buffers.

Off-by-one risk: None. lo/hi computed from len() calls within buildXidCorrelationVerdict; indices are local state.

Data race risk: None. Evaluate() is single-threaded; inputs are read-only snapshots; no goroutine spawning.

Go version: go.mod specifies 1.26.3; unsafe.String available since 1.20. ✓


Simplification Sweep: Clean

Deleted two standalone functions (xidCorrelationHeadline, xidCorrelationRemediation) inlined into composite-buffer build. Added helpers justified at ≥3 uses except writeInt (3 uses, at threshold but improves builder readability + encapsulates stack pattern). No noise comments; all doc strings serve unsafe.String contract or alloc-discipline purpose.


Benchmark Gate

Ratchet 13 → 2 allocs/event:

  • Measured floor: 1.88 (composite buffer + EvidenceTrail backing array)
  • Ceiling 2: NORTHSTAR target, justified
  • Slack budget: 6.4% (conservative vs nccl_hang's 17.6%)
  • Gate is appropriate (hard ceiling prevents creep on NORTHSTAR detector)

Verdict

A+ — Unsafe.String aliasing is sound. Golden replay byte-identical. Follows established perf recipe (composite buffer + strconv builders + pre-sized structures). Benchmark gate sensible. Code quality high.

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 01:55
@trilamsr trilamsr merged commit 3fbcc4e into main Jun 2, 2026
23 checks passed
@trilamsr trilamsr deleted the perf/417-xid-correlation-allocs branch June 2, 2026 01:58
trilamsr added a commit that referenced this pull request Jun 2, 2026
Address reviewer A- trim items on #456:
1. displayPodName: drop optimization-impact comment (function inlined
   in pod_evicted; kept exported for xid_correlation use).
2. Evaluate: collapse 9-bullet allocation-changelog into one-line
   NORTHSTAR pointer + per-file note.
3. Consolidate fmt->strconv narrative into one top-of-Evaluate note;
   drop the per-site repetition across five render helpers.
4. containsFold: replace 'equivalent to' note with explicit ASCII-only
   safety guard (anchors + kubelet eviction messages are English).

No behavior change: allocs/op = 1943 unchanged, golden replay
byte-identical, bench-allocs-check still PASS.

Verified finding 5: scripts/bench-registry.sh xid_correlation
ceiling already 2 (post-#448), no change required.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…434) (#456)

## Summary

Closes #434. Profile-driven optimization of
`PodEvictedDetector.Evaluate` drops per-event allocations from **15.27 →
1.90**, clearing the **NORTHSTAR (2)** floor. Bench ceiling lowered 16 →
2 in `scripts/bench-registry.sh::allocs_gate`;
`bench/detectors/baselines.json` updated. Same recipe as #438
(nccl_hang) and #448 (xid_correlation).

## Before / after

`go test -bench=BenchmarkPodEvictedDetector -benchmem -count=10
-benchtime=500ms ./bench/detectors/` (Apple M1 Max):

| metric | before | after | delta |
|---|---|---|---|
| allocs/op | 15635 | 1943 | **-87.6%** |
| B/op | 970020 | 473400 | -51.2% |
| allocs/event | 15.27 | **1.90** | hit NORTHSTAR |

Stability: 10 consecutive bench runs all measured 1943 allocs/op (CV =
0%).

## Root cause (named)

Per-evicted-pod the previous detector allocated ~19 objects. Profile
(`-memprofile`, `-alloc_objects`) attributed them:

1. **`fmt.Sprintf` ×5 in `buildVerdict` + helpers** — 60% of pre-fix
allocs. Sprintf boxes each arg into an `any` slice + scans the format
string into a fresh internal buffer; both escape. Replaced with
`strconv.AppendInt` + manual byte builders.
2. **`strings.ToLower(note)` in `pressureFromNote`** — 1 alloc per
evicted pod (the lowercased copy). Replaced with a zero-alloc
`containsFold` ASCII-fold scan (all anchors are already lowercase).
3. **`time.Format` in `formatTimestamp`** — Sprintf-style alloc per
headline. Replaced with `t.AppendFormat(buf, time.RFC3339)` writing into
the shared scratch buffer.
4. **Unsized `condIdx` map + per-bucket `append` growth** —
`make(map[K]V)` defaults to 8 buckets; `append(nil, r)` grows each
bucket. Pre-sized to `len(recs)`.
5. **Per-verdict `make([]EvidenceRef, 1, 2)`** — 1 alloc per verdict for
the trail slice header. Replaced with ONE contiguous `trailBacking :=
make([]EvidenceRef, 2*len(events))` sliced cap=2 per verdict.
6. **Escaping `make([]byte, dynamic-cap)` per builder call** — escape
analysis can't prove the dynamic capacity fits on stack, so each builder
allocated TWO objects (buf + the `string(buf)` cast). Replaced with a
per-Evaluate `scratch *[]byte` reused across every builder (each resets
to `(*scratch)[:0]` and appends), collapsing to ONE alloc per call (the
irreducible string cast).
7. **Redundant per-condition re-render** — 820 evictions joining onto 64
conditions re-rendered the same UID + description + remediation strings
on every join (2460 redundant allocs). Pre-rendered once at index time
into a new `indexedNodeCond` struct.
8. **Partial-path remediation re-render** — partial-path evictions (no
joined condition) re-rendered the `"On node X: <prose>"` remediation per
pod. Added a `(node, pressure) → string` cache so burst evictions on the
same node share one alloc.
9. **`sort.SliceStable` reflection** — both `indexNodeConds`'s
per-bucket sort and the outer verdicts sort used reflection-based
`sort.SliceStable` (allocates a swapper per call). Replaced with
`slices.SortStableFunc` (generics, no reflection); buckets of length ≤1
skip the sort call entirely.

## Top profile entries (memprofile -alloc_objects)

**Before:**
```
60% buildVerdict (5 fmt.Sprintf calls per eviction)
22% fmt.Sprintf
 6% nodeConditionDescription
 4% annotateRemediationWithNode
 3% time.Time.Format
 4% strings.ToLower
```

**After:**
```
42% podEventDescription   (1 alloc/call — irreducible string buf cast)
36% renderHeadline        (1 alloc/call — irreducible string buf cast)
 7% annotateRemediationWithNode  (partial-path cache misses only)
 5% nodeConditionDescription     (index-time pre-render, 64 calls total)
 3% nodeConditionUID             (index-time pre-render, 64 calls total)
```

The two per-verdict string allocations (headline + pod_event
description) are now the irreducible floor: each is a distinct verdict
field that downstream `json.Marshal` requires as separate strings.

## Verdict semantics — unchanged

Byte-identical headlines, remediation, UIDs, descriptions, JSON
envelope. Confirmed via:

- `module/pkg/replay/pod_evicted/canonical/golden.json` round-trips
unchanged.
- All `TestPodEvicted*` unit tests pass (negative fixtures,
deterministic order, partial path, out-of-window, future transition
excluded, empty node message, remediation pins node name, schema
conformance + drift rejection).
- `TestPodEvictedVerdict_SchemaConformance` (canonical full + partial)
passes.

## Gate state

| gate | result |
|---|---|
| `scripts/bench-allocs-check.sh` (hard absolute ceiling) | PASS at new
ceiling=2 |
| `scripts/bench-check-detectors.sh` (10% delta soft gate) | PASS (no
detector regressed) |
| `go test -race ./module/pkg/patterns/ ./module/pkg/replay/
./module/processor/...` | PASS |
| `go vet ./module/pkg/patterns/` | clean |
| pre-commit (golangci-lint, vet, mod verify, attribute-namespace-check)
| PASS |

```release-notes
pod_evicted detector now allocates 1.90 allocations per event (down from 15.27), hitting the v0.3.0 NORTHSTAR floor of <=2 allocs/event. Joins #418 (nccl_hang) and #417 (xid_correlation) — three of the six per-detector tracking issues at or below NORTHSTAR.
```

Closes #434
Refs #302

## Test plan

- [x] `go test -race -count=1 ./module/pkg/patterns/
./module/pkg/replay/ ./module/processor/...` PASS
- [x] `go vet ./module/pkg/patterns/` clean
- [x] `bash scripts/bench-allocs-check.sh` PASS (pod_evicted 1.90/ev <=
ceiling 2; all 6 detectors at-or-below ceiling)
- [x] `bash scripts/bench-check-detectors.sh` PASS (no detector
regressed >10%)
- [x] `go test -bench=BenchmarkPodEvictedDetector -count=10
-benchtime=500ms` stable at 1943 allocs/op (CV=0%)
- [x] Golden replay
(`module/pkg/replay/pod_evicted/canonical/golden.json`) round-trips
unchanged
- [x] Rebased onto origin/main (resolved scripts/bench-registry.sh
conflict with #448's xid_correlation ratchet — both ratchets shipped
together in this branch's view)

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

`scripts/bench-registry.sh::bench_entries` covered only
`BenchmarkPodEvictedDetector` and the pyspy parser benches, so the **%
delta regression gate** (`scripts/bench-check.sh` via
`bench-check-all.sh`) missed five of the six detectors. The absolute
allocs/event ceiling (`allocs_gate`) caught a detector that drifted past
its hard ceiling, but a detector that drifted +50% while staying under
the ceiling would ship silently. This PR closes the gap by registering
all six detectors against both gates.

## Root cause

The registry was authored when `pod_evicted` was the only detector with
a bench. The five later detectors (added in
#417/#418/#434/#438/#448/#456) were plumbed into `allocs_gate` (absolute
ceiling) but the parallel % delta gate was never extended — a historical
artifact, not an architectural defect. Fix: extend `bench_entries` to
cover the missing five.

## What changed

- `scripts/bench-registry.sh`: added one entry covering five detectors
against `./bench/detectors/`, single-regex / single `go test` invocation
(~5s incremental vs ~30s if split into 5 entries).
- `bench/detectors/testdata/bench-baseline.txt`: generated baseline
(count=10 × 500ms, Apple M1 Max). Hardware-invariant signals (B/op +
allocs/op) pin the gate; sec/op stays advisory.
- Existing ceilings in `allocs_gate` unchanged. Existing baselines
unchanged.

## Mutation verification

Bumped `PCIeAERDetector` allocs baseline down by ~18% (524 → 430):

```
PCIeAERDetector-10  430.0 ± 0%  524.0 ± 0%  +21.86% (p=0.000 n=10)
REGRESSION: the following benchmarks exceeded the 10% threshold vs baseline:
  PCIeAERDetector-10  +21.86% (p=0.000 n=10)
```

`scripts/bench-check-all.sh` exit 1. Revert → exit 0. Gate fires on
regression, stays clean on baseline.

## Coverage gap context

Cross-link: #302 (allocs/event rollup), #417 (xid_correlation
NORTHSTAR), #418 (nccl_hang NORTHSTAR), #434 (pod_evicted NORTHSTAR),
#438/#448/#456 (sibling perf PRs). Every detector that hit NORTHSTAR
under the absolute ceiling now also has a relative-regression gate.

## Test plan

- [x] `make bench-check` exit 0 (covers all 6 detectors now)
- [x] `make bench-allocs-check` exit 0
- [x] `make bench-detectors-check` exit 0 (soft-gate, unchanged)
- [x] Mutation: lower baseline → exit 1; revert → exit 0
- [x] Lint + vet (pre-commit hook): 0 issues

```release-notes
ci(bench): % delta regression gate now covers all six detectors (#302). xid_correlation, hbm_ecc, nccl_hang, thermal_throttle, and pcie_aer were previously gated only by the absolute allocs/event ceiling — they now also fail builds on >10% allocs/op or B/op drift vs the committed baseline.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(xid_correlation): lower allocs/event from 12.4 toward NORTHSTAR (2)

1 participant